Trace validation test #204

joshuazh-x · 2024-06-13T11:31:06Z

Following up on previous discussion with @serathius and @ahrtr, I make some improvements to the Raft trace validation test:

Refactored Test Case: Combine the random trace generation and trace validation script into a single test case for better readability and maintainability.
Simplified Execution: The test can now be run with a simple make trace-validation command.
Enhanced Customization: The test supports customization of the working directory, number of parallel jobs, and the amount of traces to be generated (see README file).
Other improvements: An additional bootstrap trace has been included to simplify the TLA+ specification.

These changes aim to streamline the trace validation process and make it more efficient to automate testing.

Please let me know if you have any questions or feedback.

serathius · 2024-06-19T08:47:57Z

Please fix the CI failures to ensure timely review.

Also it would be best to CC everyone that was involved with the previous PR

joshuazh-x · 2024-07-03T08:08:46Z

cc @serathius @ahrtr , here is the trace validation test with some additional improvements in TLA+ spec. The idea here is to set up a sample application based on raft, running with various faults injected. The produced traces will then be collected and validated by TLA+ trace validation. Just run make trace-validation in raft root folder.

1. Improve etcdraft spec to provide hooks for model-checking and trace validation. 2. Add new sample application based on raft and capable of injecting various faults. 3. Add new test to generate random traces (with faults) and run trace validation on them. 4. Fix some issues in etcdraft spec. Signed-off-by: Joshua Zhang <[email protected]>

serathius · 2024-07-04T07:13:44Z

CCing everyone involved in #113
/cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv

Your input on this PR would be a huge help! The successful integration of TLA+ has the potential to significantly improve the reliability of Raft and etcd. I'd be grateful for any insights you can share.

k8s-ci-robot · 2024-07-04T07:13:49Z

@serathius: GitHub didn't allow me to request PR reviews from the following users: andreo, felipecrv, xiang90, zeu5, lemmy.

Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

CCing everyone involved in #113
/cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

zeu5 · 2024-07-06T06:00:21Z

A few changes to the TLA spec are unclear for me. I will summarise my understanding and list the concerns below.

Summary

Two new variables are added applied and durableState.applied. The two variables represent the state in memory and that persisted to the log. durableState.applied is updated on PersistState action and applied is updated on Advance. The rest of the changes in TLA revolve around ensuring sound updates to these two variables.

In the code, a new trace event Advance is introduced to ensure updates to applied variable in TLA.

Clarifications

Why do we need both the variables? In the code, the only applied state tracked is in the log.
Whenever ready is requested, the log.applied is updated. This is inconsistent with the TLA model which requires that applied is updated only after the corresponding Ready is advanced. Is there a violation found corresponding to this inconsistency?
I do not see an event for PersistState in the trace logger. So the corresponding action is never executed?

Minor comments

This is not related to the changes in this PR but unclear to me - when a node is stopped (as a failure introduced makeNodesDown) it doesn't update any variables and is stopped. In the TLA model however, this is never tracked and updates are received to the node. So the state in the code and in the model will be out of sync? Shouldn't the model track active nodes?

joshuazh-x · 2024-07-07T09:00:16Z

Thanks @zeu5 for the summary.

Please allow me to clarify some of my thoughts when I made this change. Please correct me if I made any mistake here.

Why do we need both the variables? In the code, the only applied state tracked is in the log.
Whenever ready is requested, the log.applied is updated. This is inconsistent with the TLA model which requires that applied is updated only after the corresponding Ready is advanced. Is there a violation found corresponding to this inconsistency?

To my understanding, log.applied is updated upon receiving MsgStorageApplyResp, indicating successful applying to the state machine. log.applying ( I assume you meant this) is updated upon ready and serves as an optimization to allow issuing new ready without waiting for complete processing of the previous one (relevant with async storage writes). Since I prioritize sync storage write in this spec so log.applied is used here for determining entries in ready to simplify the spec, Adjustments will be made when the spec covers async storage writes.

I do not see an event for PersistState in the trace logger. So the corresponding action is never executed?

PersistState isn't traced as it's external to the Raft module, but it executes as part of Next. Initially, I aimed for non-determinism (allowing PersistState and SendMessags anywhere in the ready phase), but for TLC efficiency, I eventually choose a constrained behavior enforces sequential execution of PersistState and SendMessages after Ready. So whenever a traced Ready is processed, the behavior Ready->PersistState->Ready will be enforced.

Shouldn't the model track active nodes?

The TLA+ spec focuses on state transitions. A non-functional node implies no state changes or message exchange, effectively modeled by behaviors excluding actions related to that node. The state space exploration will include behaviors representing such node downtime scenarios.

zeu5

Thank you for the clarification @joshuazh-x I have requested some additional minor changes.

For further updates, I suppose it would also be good to include as part of the merge workflow some model checking on the updated TLA models to ensure the correctness of the updates. @serathius please consider this.

tla/validate-model.sh

tla/etcdraft.tla

tla/MCetcdraft.cfg

tla/etcdraft.tla

tla/etcdraft2.tla

1. Remove type annotations which are not needed here. 2. Add more comments for the actions. 3. Fix simulation argument in valide-model.sh. 4. Let validate-model.sh use all cores to improve performance. Signed-off-by: Joshua Zhang <[email protected]>

1. Decouple SpecAction and relevant definitions from the core etcd raft spec. Put them into a middle layer spec etcdraft_control.tla 2. Fix racing issue in trace_validation_test.go. Signed-off-by: Joshua Zhang <[email protected]>

joshuazh-x · 2024-07-08T06:48:44Z

I'll keep each fix in a separated commit during the review. If everything goes well in the end, I'll combine them into one before final merge.

joshuazh-x · 2024-08-13T02:39:26Z

/cc @serathius @ahrtr @zeu5 @lemmy any feedbacks for this pr?

k8s-ci-robot · 2024-08-13T14:25:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshuazh-x, lemmy
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahrtr · 2024-08-13T15:31:08Z

Please try not to submit huge PR. Please try to breakdown this PR as much as you can,

I see you removed the huge file tla/example.ndjson, pls do it in a separate PR and ensure the readme.md is also updated accordingly.
get the TLA+ changes in a separate PR, and leave it to TLA+ experts to review it.
get the test included in a separate PR. I think that it makes more sense to integrate the trace test with etcd's robustness test. We can discuss that after above two items are resolved.

k8s-ci-robot added the size/XXL label Jun 13, 2024

joshuazh-x marked this pull request as draft June 13, 2024 12:32

k8s-ci-robot added the do-not-merge/work-in-progress label Jun 13, 2024

joshuazh-x force-pushed the trace-validation-test-2 branch from 3b47a57 to 33c203e Compare July 3, 2024 07:57

joshuazh-x marked this pull request as ready for review July 3, 2024 07:59

k8s-ci-robot removed the do-not-merge/work-in-progress label Jul 3, 2024

joshuazh-x force-pushed the trace-validation-test-2 branch from 33c203e to 9cee7cd Compare July 4, 2024 02:58

joshuazh-x mentioned this pull request Jul 4, 2024

Improve trace validation test experience #203

Open

ahrtr self-requested a review July 4, 2024 05:55

k8s-ci-robot requested review from pav-kv, siyuanfoundation and MadhavJivrajani July 4, 2024 07:13

zeu5 reviewed Jul 7, 2024

View reviewed changes

tla/validate-model.sh Outdated Show resolved Hide resolved

tla/etcdraft.tla Show resolved Hide resolved

tla/etcdraft.tla Show resolved Hide resolved

lemmy reviewed Jul 7, 2024

View reviewed changes

tla/MCetcdraft.cfg Show resolved Hide resolved

lemmy reviewed Jul 8, 2024

View reviewed changes

tla/etcdraft.tla Outdated Show resolved Hide resolved

lemmy reviewed Jul 8, 2024

View reviewed changes

tla/etcdraft.tla Outdated Show resolved Hide resolved

lemmy reviewed Jul 8, 2024

View reviewed changes

tla/etcdraft2.tla Outdated Show resolved Hide resolved

joshuazh-x added 2 commits July 8, 2024 03:00

Address review feedbacks.

658195a

1. Remove type annotations which are not needed here. 2. Add more comments for the actions. 3. Fix simulation argument in valide-model.sh. 4. Let validate-model.sh use all cores to improve performance. Signed-off-by: Joshua Zhang <[email protected]>

Address review feedbacks.

0701128

1. Decouple SpecAction and relevant definitions from the core etcd raft spec. Put them into a middle layer spec etcdraft_control.tla 2. Fix racing issue in trace_validation_test.go. Signed-off-by: Joshua Zhang <[email protected]>

lemmy approved these changes Aug 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace validation test #204

Trace validation test #204

joshuazh-x commented Jun 13, 2024

serathius commented Jun 19, 2024 •

edited

Loading

joshuazh-x commented Jul 3, 2024

serathius commented Jul 4, 2024 •

edited

Loading

k8s-ci-robot commented Jul 4, 2024

zeu5 commented Jul 6, 2024

joshuazh-x commented Jul 7, 2024

zeu5 left a comment •

edited

Loading

joshuazh-x commented Jul 8, 2024

joshuazh-x commented Aug 13, 2024

k8s-ci-robot commented Aug 13, 2024

ahrtr commented Aug 13, 2024

Trace validation test #204

Are you sure you want to change the base?

Trace validation test #204

Conversation

joshuazh-x commented Jun 13, 2024

serathius commented Jun 19, 2024 • edited Loading

joshuazh-x commented Jul 3, 2024

serathius commented Jul 4, 2024 • edited Loading

k8s-ci-robot commented Jul 4, 2024

zeu5 commented Jul 6, 2024

Summary

Clarifications

Minor comments

joshuazh-x commented Jul 7, 2024

zeu5 left a comment • edited Loading

Choose a reason for hiding this comment

joshuazh-x commented Jul 8, 2024

joshuazh-x commented Aug 13, 2024

k8s-ci-robot commented Aug 13, 2024

ahrtr commented Aug 13, 2024

serathius commented Jun 19, 2024 •

edited

Loading

serathius commented Jul 4, 2024 •

edited

Loading

zeu5 left a comment •

edited

Loading